Data Filtering

I used the clean dataset produced in the “Colorado_Fulldataset” document in order to analyze only general public response in Twitter, since this dataset already exluded tweets sent by official agencies and bots. The total number of tweets in the dataset is 3858.

Tweets sent during the Flood and Inmediate Aftermath phases of the disaster were filtered. This means that 45% of the tweets were excluded. This is we will use 2132 tweets from the original 3858 total.

Before any spatial analysis or plotting, the data was first projected in North America Lambert Conformal Conic.

## Coordinate Reference System:
##   No EPSG code
##   proj4string: "+proj=lcc +lat_1=20 +lat_2=0 +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"


Subsetting by Spatial Clustering

One of the goals of this study is to confirm if spatially clustered tweets can serve as a proxy for reports from affected areas. So I will repeat the same analysis done for the whole dataset but now considering only tweets belonging to clusters after the spatial clustering process. A hierarchical implementation of dbscan (hdbscan) was used for the spatial clustering. For hdbscan we need to pick a number of minimum points to be considered to identify the cluster. When setting that number with any value between 159 and 225, clusters in Colorado were identify. So we picked the minimum value in that range which retains the maximum number of tweets: 160.

Most common words (Within Spatial Clusters)

After the spatial filtering only tweets sent during the Flood stage only 30% of the total tweets were retained. 1172 tweets from the 3858 total.

A quick view of the most common words in the whole dataset:

## # A tibble: 3,070 x 2
##    word             n
##    <chr>        <int>
##  1 boulder        734
##  2 boulderflood   355
##  3 cowx           158
##  4 flood          115
##  5 coflood        113
##  6 colorado       105
##  7 rain            92
##  8 flooding        79
##  9 creek           76
## 10 amp             61
## # … with 3,060 more rows

Again, since “boulder” is the most common word and is going to have a big effect in our topic modelling, it was removed from the dataset. The term “boulderflood”) was also excluded because it was so common and used neutrally in all four stages. After excluding the two terms, the new list of common words looks as follows:

## # A tibble: 3,068 x 2
##    word         n
##    <chr>    <int>
##  1 cowx       158
##  2 flood      115
##  3 coflood    113
##  4 colorado   105
##  5 rain        92
##  6 flooding    79
##  7 creek       76
##  8 amp         61
##  9 rt          55
## 10 water       52
## # … with 3,058 more rows


Topic Modeling

Again, after playing with different numbers, I decided to train a topic model with 15 topics. From 16 on, topics started to look very similar (with the same bag of words). Here is a summary of the results after this process:

Mapping topic 12 to see spatial distribution:

mapview(tweet_and_topic_geo, zcol = "topic", layer.name = "topic", burst = TRUE) + 
  mapview(affected_counties_p)